Yan CHEN Jing ZHANG Yuebing XU Yingjie ZHANG Renyuan ZHANG Yasuhiko NAKASHIMA
An efficient resistive random access memory (ReRAM) structure is developed for accelerating convolutional neural network (CNN) powered by the in-memory computation. A novel ReRAM cell circuit is designed with two-directional (2-D) accessibility. The entire memory system is organized as a 2-D array, in which specific memory cells can be identically accessed by both of column- and row-locality. For the in-memory computations of CNNs, only relevant cells in an identical sub-array are accessed by 2-D read-out operations, which is hardly implemented by conventional ReRAM cells. In this manner, the redundant access (column or row) of the conventional ReRAM structures is prevented to eliminated the unnecessary data movement when CNNs are processed in-memory. From the simulation results, the energy and bandwidth efficiency of the proposed memory structure are 1.4x and 5x of a state-of-the-art ReRAM architecture, respectively.
To minimize the annotation costs associated with training semantic segmentation models and object detection models, weakly supervised detection and weakly supervised segmentation approaches have been extensively studied. However most of these approaches assume that the domain between training and testing is the same, which at times results in considerable performance drops. For example, if we train an object detection network using only web images showing a large object at the center, it can be difficult for the network to detect multiple small objects. In this paper, we focus on training a CNN with only web images and achieve object detection in the wild. A proposal-based approach can address the problem associated with differences in domains because web images are similar to images of the proposal. In both domains, the target object is located at the center of the image and the ratio of the size of the target object to the size of the image is large. Several proposal methods have been proposed to detect regions with high “object-ness.” However, many of these proposals generate a large number of candidates to increase the recall rate. Considering the recent advent of deep CNNs, methods that generate a large number of proposals exhibit problems in terms of processing time for practical use. Therefore, we propose a CNN-based “food-ness” proposal method in this paper that requires neither pixel-wise annotation nor bounding box annotation. Our method generates proposals through backpropagation and most of these proposals focus only on food objects. In addition, we can easily control the number of proposals. Through experiments, we trained a network model using only web images and tested the model on the UEC FOOD 100 dataset. We demonstrate that the proposed method achieves high performance compared to traditional proposal methods in terms of the trade-off between accuracy and computational cost. Therefore, in this paper, we propose an intermediate approach between the traditional proposal approach and the fully convolutional approach. In particular, we propose a novel proposal method that generates high“food-ness” regions using fully convolutional networks based on the backward approach by training food images gathered from the web.
Ming LI Li SHI Xudong CHEN Sidan DU Yang LI
The large computational complexity makes stereo matching a big challenge in real-time application scenario. The problem of stereo matching in a video sequence is slightly different with that in a still image because there exists temporal correlation among video frames. However, no existing method considered temporal consistency of disparity for algorithm acceleration. In this work, we proposed a scheme called the dynamic disparity range (DDR) to optimize matching cost calculation and cost aggregation steps by narrowing disparity searching range, and a scheme called temporal cost aggregation path to optimize the cost aggregation step. Based on the schemes, we proposed the DDR-SGM and the DDR-MCCNN algorithms for the stereo matching in video sequences. Evaluation results showed that the proposed algorithms significantly reduced the computational complexity with only very slight loss of accuracy. We proved that the proposed optimizations for the stereo matching are effective and the temporal consistency in stereo video is highly useful for either improving accuracy or reducing computational complexity.
Kai TAN Qingbo WU Fanman MENG Linfeng XU
Saliency quality assessment aims at estimating the objective quality of a saliency map without access to the ground-truth. Existing works typically evaluate saliency quality by utilizing information from saliency maps to assess its compactness and closedness while ignoring the information from image content which can be used to assess the consistence and completeness of foreground. In this letter, we propose a novel multi-information fusion network to capture the information from both the saliency map and image content. The key idea is to introduce a siamese module to collect information from foreground and background, aiming to assess the consistence and completeness of foreground and the difference between foreground and background. Experiments demonstrate that by incorporating image content information, the performance of the proposed method is significantly boosted. Furthermore, we validate our method on two applications: saliency detection and segmentation. Our method is utilized to choose optimal saliency map from a set of candidate saliency maps, and the selected saliency map is feeded into an segmentation algorithm to generate a segmentation map. Experimental results verify the effectiveness of our method.
Masayuki SHIMODA Shimpei SATO Hiroki NAKAHARA
We propose an object detector using a sliding window method for an event-driven camera which outputs a subtracted frame (usually a binary value) when changes are detected in captured images. Since sliding window skips unchanged portions of the output, the number of target object area candidates decreases dramatically, which means that our system operates faster and with lower power consumption than a system using a straightforward sliding window approach. Since the event-driven camera output consists of binary precision frames, an all binarized convolutional neural network (ABCNN) can be available, which means that it allows all convolutional layers to share the same binarized convolutional circuit, thereby reducing the area requirement. We implemented our proposed method on the Xilinx Inc. Zedboard and then evaluated it using the PETS 2009 dataset. The results showed that our system outperformed BCNN system from the viewpoint of detection performance, hardware requirement, and computation time. Also, we showed that FPGA is an ideal method for our system than mobile GPU. From these results, our proposed system is more suitable for the embedded systems based on stationary cameras (such as security cameras).
Ruicong ZHI Hairui XU Ming WAN Tingting LI
Facial micro-expression is momentary and subtle facial reactions, and it is still challenging to automatically recognize facial micro-expression with high accuracy in practical applications. Extracting spatiotemporal features from facial image sequences is essential for facial micro-expression recognition. In this paper, we employed 3D Convolutional Neural Networks (3D-CNNs) for self-learning feature extraction to represent facial micro-expression effectively, since the 3D-CNNs could well extract the spatiotemporal features from facial image sequences. Moreover, transfer learning was utilized to deal with the problem of insufficient samples in the facial micro-expression database. We primarily pre-trained the 3D-CNNs on normal facial expression database Oulu-CASIA by supervised learning, then the pre-trained model was effectively transferred to the target domain, which was the facial micro-expression recognition task. The proposed method was evaluated on two available facial micro-expression datasets, i.e. CASME II and SMIC-HS. We obtained the overall accuracy of 97.6% on CASME II, and 97.4% on SMIC, which were 3.4% and 1.6% higher than the 3D-CNNs model without transfer learning, respectively. And the experimental results demonstrated that our method achieved superior performance compared to state-of-the-art methods.
Toshiki SHIBAHARA Takeshi YAGI Mitsuaki AKIYAMA Daiki CHIBA Kunio HATO
Malware-infected hosts have typically been detected using network-based Intrusion Detection Systems on the basis of characteristic patterns of HTTP requests collected with dynamic malware analysis. Since attackers continuously modify malicious HTTP requests to evade detection, novel HTTP requests sent from new malware samples need to be exhaustively collected in order to maintain a high detection rate. However, analyzing all new malware samples for a long period is infeasible in a limited amount of time. Therefore, we propose a system for efficiently collecting HTTP requests with dynamic malware analysis. Specifically, our system analyzes a malware sample for a short period and then determines whether the analysis should be continued or suspended. Our system identifies malware samples whose analyses should be continued on the basis of the network behavior in their short-period analyses. To make an accurate determination, we focus on the fact that malware communications resemble natural language from the viewpoint of data structure. We apply the recursive neural network, which has recently exhibited high classification performance in the field of natural language processing, to our proposed system. In the evaluation with 42,856 malware samples, our proposed system collected 94% of novel HTTP requests and reduced analysis time by 82% in comparison with the system that continues all analyses.
Zhenyu ZHAO Ming ZHU Yiqiang SHENG Jinlin WANG
To solve the low accuracy problem of the recommender system for long term users, in this paper, we propose a top-N-balanced sequential recommendation based on recurrent neural network. We postulated and verified that the interactions between users and items is time-dependent in the long term, but in the short term, it is time-independent. We balance the top-N recommendation and sequential recommendation to generate a better recommender list by improving the loss function and generation method. The experimental results demonstrate the effectiveness of our method. Compared with a state-of-the-art recommender algorithm, our method clearly improves the performance of the recommendation on hit rate. Besides the improvement of the basic performance, our method can also handle the cold start problem and supply new users with the same quality of service as the old users.
David ALEDO Benjamin CARRION SCHAFER Félix MORENO
This paper describes the advantages and disadvantages observed when describing complex parameterizable Artificial Neural Networks (ANNs) at the behavioral level using SystemC and at the Register Transfer Level (RTL) using VHDL. ANNs are complex to parameterize because they have a configurable number of layers, and each one of them has a unique configuration. This kind of structure makes ANNs, a priori, challenging to parameterize using Hardware Description Languages (HDL). Thus, it seems intuitively that ANNs would benefit from the raise in level of abstraction from RTL to behavioral level. This paper presents the results of implementing an ANN using both levels of abstractions. Results surprisingly show that VHDL leads to better results and allows a much higher degree of parameterization than SystemC. The implementation of these parameterizable ANNs are made open source and are freely available online. Finally, at the end of the paper we make some recommendation for future HLS tools to improve their parameterization capabilities.
Lane detection or road detection is one of the key features of autonomous driving. In computer vision area, it is still a very challenging target since there are various types of road scenarios which require a very high robustness of the algorithm. And considering the rather high speed of the vehicles, high efficiency is also a very important requirement for practicable application of autonomous driving. In this paper, we propose a deep convolution neural network based lane detection method, which consider the lane detection task as a pixel level segmentation of the lane markings. We also propose an automatic training data generating method, which can significantly reduce the effort of the training phase. Experiment proves that our method can achieve high accuracy for various road scenes in real-time.
Chunxiao FAN Yang LI Lei TIAN Yong LI
This letter proposes a representation learning framework of convolutional neural networks (Convnets) that aims to rectify and improve the feature representations learned by existing transformation-invariant methods. The existing methods usually encode feature representations invariant to a wide range of spatial transformations by augmenting input images or transforming intermediate layers. Unfortunately, simply transforming the intermediate feature maps may lead to unpredictable representations that are ineffective in describing the transformed features of the inputs. The reason is that the operations of convolution and geometric transformation are not exchangeable in most cases and so exchanging the two operations will yield the transformation error. The error may potentially harm the performance of the classification networks. Motivated by the fractal statistics of natural images, this letter proposes a rectifying transformation operator to minimize the error. The proposed method is differentiable and can be inserted into the convolutional architecture without making any modification to the optimization algorithm. We show that the rectified feature representations result in better classification performance on two benchmarks.
Target recognition in Millimeter-wave Interferometric Synthetic Aperture Radiometer (MMW InSAR) imaging is always a crucial task. However, the recognition performance of conventional algorithms degrades when facing unpredictable noise interference in practical scenarios and information-loss caused by inverse imaging processing of InSAR. These difficulties make it very necessary to develop general-purpose denoising techniques and robust feature extractors for InSAR target recognition. In this paper, we propose a denoising convolutional neural network (D-CNN) and demonstrate its advantage on MMW InSAR automatic target recognition problem. Instead of directly feeding the MMW InSAR image to the CNN, the proposed algorithm utilizes the visibility function samples as the input of the fully connected denoising layer and recasts the target recognition as a data-driven supervised learning task, which learns the robust feature representations from the space-frequency domain. Comparing with traditional methods which act on the MMW InSAR output images, the D-CNN will not be affected by information-loss accused by inverse imaging process. Furthermore, experimental results on the simulated MMW InSAR images dataset illustrate that the D-CNN has superior immunity to noise, and achieves an outstanding performance on the recognition task.
Suofei ZHANG Bin KANG Lin ZHOU
Instance features based deep learning methods prompt the performances of high speed object tracking systems by directly comparing target with its template during training and tracking. However, from the perspective of human vision system, prior knowledge of target also plays key role during the process of tracking. To integrate both semantic knowledge and instance features, we propose a convolutional network based object tracking framework to simultaneously output bounding boxes based on different prior knowledge as well as confidences of corresponding Assumptions. Experimental results show that our proposed approach retains both higher accuracy and efficiency than other leading methods on tracking tasks covering most daily objects.
Lina GONG Shujuan JIANG Qiao YU Li JIANG
Heterogeneous defect prediction (HDP) is to detect the largest number of defective software modules in one project by using historical data collected from other projects with different metrics. However, these data can not be directly used because of different metrics set among projects. Meanwhile, software data have more non-defective instances than defective instances which may cause a significant bias towards defective instances. To completely solve these two restrictions, we propose unsupervised deep domain adaptation approach to build a HDP model. Specifically, we firstly map the data of source and target projects into a unified metric representation (UMR). Then, we design a simple neural network (SNN) model to deal with the heterogeneous and class-imbalanced problems in software defect prediction (SDP). In particular, our model introduces the Maximum Mean Discrepancy (MMD) as the distance between the source and target data to reduce the distribution mismatch, and use the cross-entropy loss function as the classification loss. Extensive experiments on 18 public projects from four datasets indicate that the proposed approach can build an effective prediction model for heterogeneous defect prediction (HDP) and outperforms the related competing approaches.
Gaofeng CHENG Pengyuan ZHANG Ji XU
The long short-term memory recurrent neural network (LSTM) has achieved tremendous success for automatic speech recognition (ASR). However, the complicated gating mechanism of LSTM introduces a massive computational cost and limits the application of LSTM in some scenarios. In this paper, we describe our work on accelerating the decoding speed and improving the decoding accuracy. First, we propose an architecture, which is called Projected Gated Recurrent Unit (PGRU), for ASR tasks, and show that the PGRU can consistently outperform the standard GRU. Second, to improve the PGRU generalization, particularly on large-scale ASR tasks, we propose the Output-gate PGRU (OPGRU). In addition, the time delay neural network (TDNN) and normalization methods are found beneficial for OPGRU. In this paper, we apply the OPGRU for both the acoustic model and recurrent neural network language model (RNN-LM). Finally, we evaluate the PGRU on the total Eval2000 / RT03 test sets, and the proposed OPGRU single ASR system achieves 0.9% / 0.9% absolute (8.2% / 8.6% relative) reduction in word error rate (WER) compared to our previous best LSTM single ASR system. Furthermore, the OPGRU ASR system achieves significant speed-up on both acoustic model and language model rescoring.
Zheng FANG Tieyong CAO Jibin YANG Meng SUN
Saliency detection is widely used in many vision tasks like image retrieval, compression and person re-identification. The deep-learning methods have got great results but most of them focused more on the performance ignored the efficiency of models, which were hard to transplant into other applications. So how to design a efficient model has became the main problem. In this letter, we propose parallel feature network, a saliency model which is built on convolution neural network (CNN) by a parallel method. Parallel dilation blocks are first used to extract features from different layers of CNN, then a parallel upsampling structure is adopted to upsample feature maps. Finally saliency maps are obtained by fusing summations and concatenations of feature maps. Our final model built on VGG-16 is much smaller and faster than existing saliency models and also achieves state-of-the-art performance.
Hiroshi SEKI Kazumasa YAMAMOTO Tomoyosi AKIBA Seiichi NAKAGAWA
Deep neural networks (DNNs) have achieved significant success in the field of automatic speech recognition. One main advantage of DNNs is automatic feature extraction without human intervention. However, adaptation under limited available data remains a major challenge for DNN-based systems because of their enormous free parameters. In this paper, we propose a filterbank-incorporated DNN that incorporates a filterbank layer that presents the filter shape/center frequency and a DNN-based acoustic model. The filterbank layer and the following networks of the proposed model are trained jointly by exploiting the advantages of the hierarchical feature extraction, while most systems use pre-defined mel-scale filterbank features as input acoustic features to DNNs. Filters in the filterbank layer are parameterized to represent speaker characteristics while minimizing a number of parameters. The optimization of one type of parameters corresponds to the Vocal Tract Length Normalization (VTLN), and another type corresponds to feature-space Maximum Linear Likelihood Regression (fMLLR) and feature-space Discriminative Linear Regression (fDLR). Since the filterbank layer consists of just a few parameters, it is advantageous in adaptation under limited available data. In the experiment, filterbank-incorporated DNNs showed effectiveness in speaker/gender adaptations under limited adaptation data. Experimental results on CSJ task demonstrate that the adaptation of proposed model showed 5.8% word error reduction ratio with 10 utterances against the un-adapted model.
Shengchang LAN Zonglong HE Weichu CHEN Kai YAO
In order to provide an alternative solution of human machine interfaces, this paper proposed to recognize 10 human hand gestures regularly used in the consumer electronics controlling scenarios based on a three-dimensional radar array. This radar array was composed of three low cost 24GHz K-band Doppler CW (Continuous Wave) miniature I/Q (In-phase and Quadrature) transceiver sensors perpendicularly mounted to each other. Temporal and spectral analysis was performed to extract magnitude and phase features from six channels of I/Q signals. Two classifiers were proposed to implement the recognition. Firstly, a decision tree classifier performed a fast responsive recognition by using the supervised thresholds. To improve the recognition robustness, this paper further studied the recognition using a two layer CNN (Convolutional Neural Network) classifier with the frequency spectra as the inputs. Finally, the paper demonstrated the experiments and analysed the performances of the radar array respectively. Results showed that the proposed system could reach a high recognition accurate rate higher than 92%.
Ippei HAMAMOTO Masaki KAWAMURA
An autoencoder has the potential ability to compress and decompress information. In this work, we consider the process of generating a stego-image from an original image and watermarks as compression, and the process of recovering the original image and watermarks from the stego-image as decompression. We propose embedder and extractor neural networks based on the autoencoder. The embedder network learns mapping from the DCT coefficients of the original image and a watermark to those of the stego-image. The extractor network learns mapping from the DCT coefficients of the stego-image to the watermark. Once the proposed neural network has been trained, the network can embed and extract the watermark into unlearned test images. We investigated the relation between the number of neurons and network performance by computer simulations and found that the trained neural network could provide high-quality stego-images and watermarks with few errors. We also evaluated the robustness against JPEG compression and found that, when suitable parameters were used, the watermarks were extracted with an average BER lower than 0.01 and image quality over 35 dB when the quality factor Q was over 50. We also investigated how to represent the watermarks in the stego-image by our neural network. There are two possibilities: distributed representation and sparse representation. From the results of investigation into the output of the stego layer (3rd layer), we found that the distributed representation emerged at an early learning step and then sparse representation came out at a later step.
Mengbo ZHANG Lunwen WANG Yanqing FENG Haibo YIN
Spectrum sensing is the first task performed by cognitive radio (CR) networks. In this paper we propose a spectrum sensing algorithm for orthogonal frequency division multiplex (OFDM) signal based on deep learning and covariance matrix graph. The advantage of deep learning in image processing is applied to the spectrum sensing of OFDM signals. We start by building the spectrum sensing model of OFDM signal, and then analyze structural characteristics of covariance matrix (CM). Once CM has been normalized and transformed into a gray level representation, the gray scale map of covariance matrix (GSM-CM) is established. Then, the convolutional neural network (CNN) is designed based on the LeNet-5 network, which is used to learn the training data to obtain more abstract features hierarchically. Finally, the test data is input into the trained spectrum sensing network model, based on which spectrum sensing of OFDM signals is completed. Simulation results show that this method can complete the spectrum sensing task by taking advantage of the GSM-CM model, which has better spectrum sensing performance for OFDM signals under low SNR than existing methods.